CPSign: conformal prediction for cheminformatics modeling

Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the Venn-ABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4J models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed message-passing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in production-use in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data. CPSign can be downloaded from GitHub at https://github.com/arosbio/cpsign. Scientific contribution CPSign provides a single software that allows users to perform data preprocessing, modeling and make predictions directly on chemical structures, using conformal and probabilistic prediction. Building and evaluating new models can be achieved at a high abstraction level, without sacrificing flexibility and predictive performance—showcased with a method evaluation against contemporary modeling approaches, where CPSign performs on par with a state-of-the-art deep learning based model. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00870-9.

Calibration results aggregated for all evaluated methods and datasets, panels A-C for classification models, and panels D-F for regression models.For classification the calibration is analyzed independently per class (I: inactive, A: active), where the active class is the minority class for all datasets (Table 1, main article).The classification values were based on 30 significance levels (0.01,0.02,...,0.3) whereas the regression ones were based on six levels (0.05,0.1,...,0.3).Panel A and B display "max diff" corresponding to the expression max ε {error rate ε − ε}, i.e. the signed difference of error-rate and significance level, where a negative value corresponds to the error rate being smaller than the significance level across all tested significance levels, and a positive value means that the error rate exceeded the significance level with at most that difference (smaller values are preferable).Panel B and E display the root mean squared error (RMSE) between the significance level and the error rate (smaller values are preferable).Panels C and F display the "capped" RMSE, in which the error rate is capped at the significance level if it is lower than the significance level (for every evaluated significance level), so that over-conservative predictions (i.e.lower error rate than required) do not contribute to a higher RMSE.

Table 1:
Summary of some the features and available configurations within CPSign.Bold faced words are the default for the given row/item in the table.Further note that all of these, save from the predictor type, can be extended and injected with custom implementations.Further note that the item "Data transformations" list types of transformations, and there can be several implementations to chose from from each type.Performance ranking of the modeling methods in the comparison.The "Top" column shows the number of datasets that each method produced the most efficient predictions, whereas the "Rank" column displays the sum of ranks across all datasets.The best value in each column is displayed in bold.OF: Observed Fuzziness.A lower OF score is preferable.Larger differences can be seen when analyzing the datasets individually, and each modeling method produces the best predictions for at least one datase (see Table 2).Table 3: Grid of predicted atom contributions for ten recent cancer drugs, based on six of the regression datasets and using the CPSign method.Each image uses a blue-red coloring scheme, where blue indicate atoms part of features (substructures) contributing towards a lower predicted value, and vice versa for red.We can see that each model finds different substructures as the most important for their prediction (i.e., looking row-wise in the grid) and that the prediction intervals differ in their width between different drugs (i.e., looking column-wise).The prediction width is based on the estimated difficulty in predicting the drug based on the error model.

Lipophilicity
FreeSolv ESOL Figure1: Calibration results aggregated for all evaluated methods and datasets, panels A-C for classification models, and panels D-F for regression models.For classification the calibration is analyzed independently per class (I: inactive, A: active), where the active class is the minority class for all datasets (Table1, main article).The classification values were based on 30 significance levels (0.01,0.02,...,0.3) whereas the regression ones were based on six levels (0.05,0.1,...,0.3).Panel A and B display "max diff" corresponding to the expression max ε {error rate ε − ε}, i.e. the signed difference of error-rate and significance level, where a negative value corresponds to the error rate being smaller than the significance level across all tested significance levels, and a positive value means that the error rate exceeded the significance level with at most that difference (smaller values are preferable).Panel B and E display the root mean squared error (RMSE) between the significance level and the error rate (smaller values are preferable).Panels C and F display the "capped" RMSE, in which the error rate is capped at the significance level if it is lower than the significance level (for every evaluated significance level), so that over-conservative predictions (i.e.lower error rate than required) do not contribute to a higher RMSE.

Figure 2 :
Figure 2: Calibration curves for the classification datasets, showing one curve for each class (I: inactive, A: active).The active class is the minority class for all datasets, displaying worse calibration, which can be more easily seen in the aggregation in Figure 1.All calibration curves were evaluated from 0.01,0.02,...,0.3. 0

Figure 5 :
Figure5: Median fraction of single-label predictions for the evaluated methods for all sixteen datasets, based on the three significance levels 0.1, 0.2 and 0.3 (corresponding to confidence levels 90 %, 80 % and 70 %, respectively).A higher fraction is preferable.Similarly as Figure4the prediction results display larger differences than the aggregated results, with different methods being the top-performing one.

Figure 7 :Figure 8 :
Figure7: Runtime comparison for individual classification datasets, using a logarithmic y-axis.The relative runtimes are consistent across all runs, except the PCBA-686978 dataset (where CPSign and CPSign tuned were run with linear SVM kernels).Note that the two Chemprop methods do not contain a separate step for computing descriptors, which instead is included in the tuning and training steps.